WHAT IS CLAIMED IS : . 

1 1. A method of downloading data sets from among a plurality of host computers, 

2^\comprising the steps of: / 

^3 y (a) storing representations if data set addresses in a set of data structures, 

^ /4 including a buffer and a first disk file,/wherein representations of data set addresses stored in 

5 the first disk file are ordered; / 

6 (b) downloading at least one data set that includes addresses of one or more 

7 referred data sets; / 

8 (c) identifying the addresses of the one or more referred data sets; 

9 (d) for each identified address: 

10 (dl) generating a Representation of the identified address; 

1 1 (d2) determining ^hether the representation is stored in the buffer, and when 

12 this determination is negative, storing the representation in the buffer; and 

13 (e) when the buffer A-eaches a predefined full condition: 

14 (el) ordering tne contents of the buffer according to the representations; and 

15 (e2) performing an ordered merge of the contents of the buffer into the 

16 contents of the first disk file./ 

1 2. The method of claim 1, further comprising: 

2 in step (d2), when the determination is negative, storing the identified address in the 

3 buffer. / 

1 3. The method of claim 1, further comprising: 

2 in step (d2), wMen the determination is negative, storing the identified address in a 

3 second disk file; / 

4 in step (d2), additionally storing with each representation in the buffer a pointer to the 

5 corresponding addrefes stored in the second disk file; and 

6 in step (el)7while ordering the contents of the buffer, keeping with each 

7 representation in tne buffer its pointer to the corresponding address in the second disk file. 
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The method of claim 3 v 



herein 



step (e2) includes: for e£ ch representation in the buffer storing an associated flag, 
setting the flag to a first value w hen the representation is equal to a representation previously 
stored in the first disk file, and setting the flag to a second value, distinct fi-om the first value, 
when the representation is not equal to any representation previously stored in the first disk 



file; and 

step (e) includes: (e3) ft 



r each representation whose flag is set to the second value. 



scheduling the corresponding aata set for downloading. 



(contents of the buffer with the ordered contents of the sparse 



5. The method of claim 1 wherein: 

step (a), storing representations of data set addresses, includes the step of storing 
representations of data set addresses in a sparse disk file which is divided into portions, each 
portion having a starting addr ?ss and contents comprising an ordered list of representations of 
data addresses; and 

step (e2), merging the 
disk file, includes: 

for each of a pldrality of the representations stored in the buffer: 
(e2-l) determir ing a starting address for a corresponding portion of the sparse 

disk file; and 

(e2-2) perforrn^ng an ordered merge of a subset of the buffer, starting at the 
representation for which the parting address was obtained, into the contents of the 
corresponding portion. 



6. The method of clairfi 1 wherein: 

step (a), storing representations of data set addresses, includes the step of storing 
representations of data seti addresses in a sparse disk file having empty entries interspersed 
among entries storing saifl representations; and 

step (e2), merging the contents of the buffer with the ordered contents of the sparse 
disk file, includes: 

for each Respective representation stored in the buffer: 

(e2-l) determining a starting address for a corresponding portion of the sparse 
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9 disk file; and 

10 (e2-2) sequentially sianning the disk file, starting at the representation for 

1 1 which the starting address was obtained, until the first of (A) a representation matching the 

12 respective representation is found apd (B) one of the empty entries is found, and when an 

13 empty entry is found storing the respective representation in the empty entry. 

1 7. The method of claim 1 wherein, in step (dl), the representation comprises a checksum 

2 of at least a portion of the identififed address. 
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8. The method of claim 1 wherein step (d2) further comprises: 



(d2-l) determining whe 



ler the representation is stored in a cache before determining 



whether the representation is st )red in the buffer; 

(d2-2) when the representation is not stored in the cache, the cache has not reached a 
predefined fiiU condition, and ^ther predefined criteria are met, adding the representation to 
the cache; and 

(d2-3) when the represlentation is not stored in the cache, the cache has reached said 
predefined fiill condition, ana said other predefined criteria are met, evicting a stored 
representation fi"om the cacne in accordance with an eviction policy and adding the 
representation to the cache/ 

9. The method of claim 1 wherein step (e2) further comprises: 

when a represenmtion in the first buffer is not found in the first disk file during 
merging, scheduling ths corresponding data set for downloading. 
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10. The method of claim 8 wherein step (e2) further comprises: 

when a representation in the buffer is not found in the first disk file during merging, 
scheduling the corresponding data set for downloading. 
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1 1 . The methbd of claim 8 wherein: 

step (a),/storing representations of data, set addresses, includes the step of storing 
representationsfof data set addresses in a sparse disk file which is divided into portions, each 
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portion having a starting address anc^ contents comprising an ordered list of representations of 
data addresses; and 

step (e2), performing an ordejred merge of the contents of the buffer into the contents 



of the sparse disk file, includes: 

for each of a plurahty of the 



representations stored in the buffer: 



disk file; and 



(e2-l) obtaining a starting address for a corresponding portion of the sparse 



(e2-2) performing £n ordered merge of a subset of the buffer, starting at the 



representation for which the startin 
corresponding portion. 



address was obtained, into the contents of the 



12. The method of claim 8 wllerein: 

step (a), storing representations of data set addresses, includes the step of storing 
representations of data set addresses in a sparse disk file having empty entries interspersed 
among entries storing said representations; and 

step (e2), merging the eContents of the buffer with the ordered contents of the sparse 
disk file, includes: 

for each respective representation stored in the buffer: 
(e2-l) determming a starting address for a corresponding portion of the sparse 

disk file; and 

(e2-2) sequeritially scanning the disk file, starting at the representation for 
which the starting address Was obtained, until the first of (A) a representation matching the 
respective representation is/found and (B) one of the empty entries is found, and when an 
empty entry is found storij/lg the respective representation in the empty entry. 



13. A method of downloading data sets fi-om among a plurality of host computers, 
comprising the steps of/ 

(a) storing fepresentations of data set addresses in a set of data structures, 
including a first buffe/, a second buffer, and a first disk file, wherein the first disk file 
contains ordered representations of data set addresses; 

(b) selecting as a current buffer one of the first and second buffers; 
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(c) downloading at least one data set that includes addresses of one or more 
referred data sets; / 

(d) identifying the addresses of the one or more referred data sets; and 

(e) for each identified address: 

(el) generating a representation of the identified address; and 
(e2) determining whsther the representation is stored in the current buffer, and 
when this determination is negative . storing the representation in the current buffer; and 

(f) when the current buffer reaches a predefined full condition: 



(fl) 



selecting the, 



.other buffer as the current buffer, wherein the previously 
current buffer is identified as a non-current buffer; 

(£2) ordering thfe representations stored in the non-current buffer; and 
(O) performing an ordered merge of the contents of the non-current buffer 
into the contents of the first disk file. 
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1 14. The method of claim l|3, further comprising: 

2 in step (e2), when the determination is negative, storing the identified address in the 

3 current buffer. 

1 15. The method of claim/ 13, further comprising: 

2 in step (e2), when th^ determination is negative, storing the identified address in a 

3 second disk file; 

4 in step (e2), additiorfally storing with each representation in the current buffer a 

5 pointer to the corresponding address stored in the second disk file; and 

6 in step (f2), while ordering the contents of the non-current buffer, keeping with each 

7 representation in the non-(^rrent buffer its pointer to the corresponding address in the second 

8 disk file. 
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16. The method of cliim 15 wherein 

step (e2) comprises: for each representation in the buffer storing an associated flag, 
setting the flag to a firstf value when the representation is equal to a representation previously 
stored in the first disk file, and setting the flag to a second value, distinct fi-om the first value, 
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5 when the representation is not ecjfual to any representation previously stored in the first disk 

6 file; and 

7 step (f) includes: (f4) for each representation whose flag is set to the second value, 

8 scheduling the corresponding data set for downloading 
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1 7. The method of claim 1 3 
when a representation in 



wherein step (e2) fiirther comprises: 

the current buffer is not found in the first disk file during 



merging, scheduling the corresponding data set for downloading. 



18. 



The method of claim 1 5 wherein: 
step (a), storing repres(;ntations of data set addresses, includes storing representations 
of data set addresses in a sparse disk file which is divided into portions, each portion having a 
starting address and contents c Dmprising an ordered list of representations of data addresses; 
and 

step (e2), performing an ordered merge of the contents of the current buffer into the 
contents of the sparse disk file, comprises the following steps: 

for each of a plurality of the representations stored in the current buffer: 

(e2-l) obtaining a starting address for a corresponding portion of the sparse 

disk file; and 

(e2-2) perfomling an ordered merge of a subset of the current buffer, starting 
at the representation for whicji the starting address was obtained, into the contents of the 
corresponding portion. 
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The method of claim 



3 wherein: 

step (a), storing representations of data set addresses, includes the step of storing 
representations of data set ad iresses in a sparse disk file having empty entries interspersed 
among entries storing said representations; and 

step (e2), merging thp contents of the buffer with the ordered contents of the sparse 
disk file, includes: 

for each resjSective representation stored in the buffer: 
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8 (e2-l) determining/ a starting address for a corresponding portion of the sparse 

9 disk file; and 

10 (e2-2) sequentialljj scanning the disk file, starting at the representation for 

1 1 which the starting address was obiained, until the first of (A) a representation matching the 

12 respective representation is foimd and (B) one of the empty entries is found, and when an 

13 empty entry is found storing the Respective representation in the empty entry. 

1 20. The method of claim 13 wherein, in step (el), the representation comprises a 

2 checksum of at least a portion/of the identifi.ed address. 
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21. The method of claim 13 wherein step (e2) further comprises: 

(e2-l) determining Avhether the representation is stored in a cache before determining 
whether the representation is stored in the current buffer; 

(e2-2) when the representation is not stored in the cache, and the cache has not 
reached a predefined fiul condition, adding the representation to the cache; and 

(e2-3) when the representation is not stored in the cache, and the cache has reached 
said predefined fiiU condition, evicting a stored representation fi-om the cache in accordance 
with an eviction policy and adding the representation to the cache. 

22. A method opF downloading data sets fi-om among a plurality of host computers, 
comprising the stqps of: 

(a) stJring representations of data set addresses in a set of data structures, 
including a bufftr and a disk file, wherein representations of data set addresses stored in the 
disk file are oraered; 

downloading at least one data set that includes an address of a referred data 



set; 



(b) 

(c) 
i) 



disk file is empty; 



identifying the address of the referred data set; 
generating a representation of the identified address; 

determining whether the representation is stored in the buffer, and whether the 
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12 (f) when the representation is not stored in the buffer and the disk file is empty, 

13 scheduling the corresponding data set for downloading; and 

14 (g) when the representation is not stored in the buffer and the disk file is not 

15 empty, storing the representation in the buffer and delaying scheduling of the corresponding 

16 data set for downloading until it is determined that the representation has not been previously 

1 7 stored in the disk file. / 



III 



1 23. A computer program product for use in conjunction with a computer system, the 

2 computer program prodpct comprising a computer readable storage m.edium and a computer 

3 program mechanism embedded therein, the computer program mechanism comprising: 

4 a first disk file And a buffer, for storing representations of data set addresses; 

5 a main web crawler module for downloading and processing data sets stored on a 

6 plurality of host computers, the main web crawler module identifying addresses of the one or 

7 more referred data sets in the downloaded data sets; and 

8 an address fMtering module for processing a specified one of the identified addresses; 

9 the address filtering module including instructions for: 

10 generating a representation of the identified address; 

1 1 determining whether the representation is stored in the buffer, and when this 

12 determination is negative storing the representation in the buffer; and 

13 determining whether the buffer has reached a predefined full condition, and 

14 when this determination is positive, ordering the contents of the buffer and then performing 

15 an ordered merge of contents of the buffer into the contents of the first disk file. 
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24. The coriiputer program product of claim 23, wherein the address filtering module 
further includes instructions for storing the identified address in the buffer after determining 
that the representation is not stored in the buffer. 
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25. Th^computer program product of claim 23, wherein the address filtering module 
further includes instructions for; 

stdring the identified address in a second disk file after determining that the 
representation is not stored in the buffer; and 
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storing with each representation in the buffer a pointer to the corresponding address 
stored in the second disk file; and 

during the ordering of the contents of the buffer, keeping with each representation in 
the buffer its pointer to the colresponding address in the second disk file. 

26. The computer program product of claim 23, wherein 

the first disk file is alsparse disk file divided into portions, each portion having a 
starting address and contents comprising an ordered list of representations of data addresses; 
and / 

the address filtering module includes instructions for performing the ordered merge of 
the ordered contents of thfc buffer with the contents of the sparse disk file by obtaining a 
starting address for a suW-file of the sparse disk file, the portion corresponding to one of the 
representations in the buffer, and performing an ordered merge of a subset of the 
representations in th^uffer, starting at the one representation, into the contents of the 
portion. / 

27. The computer program product of claim 23, wherein 

the first disk file is a sparse disk file having empty entries interspersed among entries 
storing said representations of data addresses; and 

the addrfess filtering module includes instructions for performing the ordered merge of 
the ordered contents of the buffer with the contents of the sparse disk file by obtaining a 
starting address corresponding to each respective representations in the buffer, and 
sequentially ficanning the first disk file, starting at the starting address, until the first of (A) a 
representation matching the respective representation is found and (B) one of the empty 
entries is found, and when an empty entry is found storing the respective representation in the 
empty erpy. 

28. rrhe computer program product of claim 23 wherein the representation of the 
identified address comprises a checksum of at least a portion of the identified address. 
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29. The computer program product of claim 23, wherein the address filtering module 
further includes instructions for first determining whether the representation is stored in a 
cache, and when the first determination is positive, skipping the determination of whether the 
representation is stored in ttje buffer. 

30. The computer program product of claim 23, wherein the address filtering module 
further includes instructions for; 

determining whethpr the first disk file is empty and whether the representation is 
stored in the buffer; and 

if the first disk fild is empty and the representation is not stored in the buffer, storing 
the representation in the buffer and scheduling the corresponding data set for downloading. 



31. A computer profi(ram product for use in conjunction with a computer system, the 
computer program product comprising a computer readable storage medium and a computer 
program mechanism embedded therein, the computer program mechanism comprising: 

a first disk filej a first buffer, and a second buffer, for storing representations of data 
set addresses; 

a main web cfawler module for downloading and processing data sets stored on a 
plurality of host computers, the main web crawler module identifying addresses of the one or 
more referred data ^ts in the downloaded data sets; and 

an address filtering module for processing a specified one of the identified addresses; 
the address filterinJg module including instructions for: 

idehtifying one of the first and second buffers as a current buffer; 
generating a representation of the identified address; 
Estermining whether the representation is stored in the current buffer, and 
when this determination is negative, storing the representation in the current buffer; and 

ietermining whether the current buffer has reached a predefined full condition, 
and when this/determination is positive, selecting the other buffer as the current buffer, 
wherein the jfreviously current buffer is identified as a non-current buffer, ordering the 
contents of the non-current buffer and then performing an ordered merge of the contents of 
the non-cur*nt buffer into the contents of the .first disk file. 
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32. The computer program product of claim 3 1 , wherein the address filtering module 
further includes instructions for storing the identified address in the current buffer after 
determining that the representation is not stored in the current buffer. 

33. The computer program toroduct of claim 3 1 , wherein the address filtering module 
further includes instructions for: 

storing the identified address in a second disk file after determining that the 
representation is not stored in/the current buffer; 

storing with each reptesentation in the current buffer a pointer to the corresponding 
address stored in the second nisk file; and 

during the ordering 6f the contents of the non-current buffer, keeping with each 
representation in the non-cmrrent buffer its pointer to the corresponding address in the second 
disk file. / 

34. The computer program product of claim 3 1 , wherein 

the first disk file is a sparse disk file divided into sub-files, each sub-file having a 
starting address and contents comprising an ordered list of representations of data addresses; 
and / 

the instructions for performing the ordered merge including instructions for obtaining 
a starting address for a /sub-file of the first disk file, the sub-file corresponding to one of the 
representations in the Buffer, and performing an ordered merge of a subset of the 
representations in the non-current buffer, starting at the one representation, into the contents 
of the sub-file. / 

35. The computer program product of claim 3 1 , wherein 

the first disk file is a sparse disk file having empty entries interspersed among entries 
storing said representations of data addresses; and 

the addresfs filtering module includes instructions for performing the ordered merge of 
the ordered contents of the buffer with the contents of the sparse disk file by obtaining a 
starting addres^ corresponding to each respective representations in the buffer, and 
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# 



sequentially scanning the first disk file, 
representation matching the respective rep 
entries is found, and when an empty entry 
empty entry. 



string at the starting address, until the first of (A) a 
esentation is found and (B) one of the empty 
is foimd storing the respective representation in the 



36. The computer program product of/claim 31 wherein the representation of the 
identified address comprises a checksum Jbf at least a portion of the identified address. 

37. The computer program product pf claim 31, wherein the address filtering m.odule 
fiirther includes instructions for: 

determining whether the first c^sk file is empty and whether the representation is 
stored in the current buffer; and 

if the first disk file is empty aid the representation is not stored in the current buffer, 
storing the representation in the curr^t buffer and scheduling the corresponding data set for 
downloading. 



38. A web crawler for downlo^ing data set addresses fi*om among a plurality of host 
computers, comprising: 

a first disk file and a buffer, for storing representations of data set addresses; 

a main web crawler module for downloading and processing data sets stored on a 
plurality of host computers, me main web crawler module identifying addresses of the one or 
more referred data sets in thp downloaded data sets; and 

an address filtering Anodule for processing a specified one of the identified addresses; 
the address filtering module including instructions for: 

generating a representation of the identified address; 

determining whether the representation is stored in the buffer, and when this 
determination is negative storing the representation in the buffer; and 

deteEmining whether the buffer has reached a predefined full condition, and 
when this determination is positive, ordering the contents of the buffer and then performing 
an ordered merg/of the contents of the buffer into the contents of the first disk file. 
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39. The web crawler of claim 38, wherein the address filtering module further includes 
instructions for storing the identi; ied address in the buffer following a determination that the 
representation is not stored in the buffer. 

40. The web crawler of claim 38, wherein the address filtering module further includes 
instructions for: I 

storing the identified address in a second disk file after determining that the 
representation is not stored in tme buffer; and 

storing with each representation in the buffer a pointer to the corresponding address 
stored in the second disk file; iand 

during the ordering of the contents of the buffer, keeping with each representation in 
the buffer its pointer to the aorresponding address in the second disk file. 

4 1 . The web crawler ofl claim 38 wherein 

the first disk file is/a sparse disk file divided into portions, each portion having a 
starting address and contents comprising an ordered list of representations of data addresses; 
and / 

the address filtering module further includes instructions for: 

obtaining/ fi-om an index, a starting address for a portion in the sparse disk file 
corresponding to one offthe representations stored in the buffer; and 

performing an ordered merge of a subset of the representations stored in the 
buffer, starting at the representation for which the starting address was obtained, into the 
contents of the corresponding portion. 

42. The web crawler of claim 38 wherein 

the first disWfile is a sparse disk file having empty entries interspersed among entries 
storing said representations of data addresses; and 

the address /filtering module includes instructions for performing the ordered merge of 
the ordered contents of the buffer with the contents of the sparse disk file by obtaining a 
starting address copresponding to each respective representations in the buffer, and 
sequentially scanriing the first disk file, starting at the starting address, until the first of (A) a 
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8 representation matching the respective rroresentation is found and (B) one of the empty 

9 entries is found, and when an empty en^ is found storing the respective representation in the 
1 0 empty entry. 

1 43. The web crawler of claim 38 wherein the representation of the identified address 

2 comprises a checksum of at least a poijtion of the identified address. 



1 44. The web crawler of claim 38/vherein the address filtering module fiirther includes 

2 instructions for: 

3 determining whether the re^i^resentation is stored in a cache before determining 

4 whether the representation is stocfed in the buffer, and when this determination is negative, 

5 determining whether the represmtation is stored in the buffer; 

6 when the second deterrriination is negative, storing the representation in the buffer; 

7 when the first determination is negative, and predefined other criteria are met, storing 

8 the representation in the cache; and 

9 when the cache has reached a predefined fiill condition, evicting a stored 
10 representation fi-om the cache in accordance with an eviction policy. 
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45. The web crawler of claim 38 wherein the address filtering module further includes 
instructions for determiniAg whether the first disk file is empty and whether the 
representation is stored irk the buffer, and if the first disk file is empty and the representation 
is not stored in the buff^, storing the representation in the buffer and scheduling the 
corresponding data set for downloading. 
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46. A web crawler /for downloading data set addresses fi-om among a plurality of host 
computers, comprising 

a first disk filjfe, a first buffer and a second buffer, for storing representations of data 
set addresses; 

a main weiy crawler module for downloading and processing data sets stored on a 
plurality of host computers, the main web crawler module identifying addresses of the one or 
more referred data sets in the downloaded data sets; and 
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an address filtering module for processing a specified one of the identified addresses; 
the address filtering module including instructions for: 

identifying one of the fiyst and second buffers as a current buffer; 

generating a representation of the identified address; 

determining whether/the representation is stored in the current buffer, and 
when this determination is negati^, storing the representation in the current buffer; and 

determining whether the current buffer has reached a predefined full condition, 
and when this determination is/positive, selecting the other buffer as the current buffer, 
wherein the previously current buffer is identified as a non-current buffer, ordering the 
contents of the non-current buffer and then performing an ordered merge of the contents of 
the non-current buffer intor the contents of the first disk file. 

47. The web crawler of claim 46, wherein the address filtering module further includes 
instructions for storing the identified address in the current buffer after determining that the 
representation is nm; stored in the current buffer. 

48. The wey crawler of claim 46, wherein the address filtering module further includes 
instructions for: 

storing the identified address in a second disk file after determining that the 
representation is not stored in the current buffer; 

storing with each representation in the current buffer a pointer to the corresponding 
address stored in the second disk file; and 

diiring the ordering of the contents of the non-current buffer, keeping with each 
representation in the non-current buffer its pointer to the corresponding address in the second 
disk m. 



49. / The web crawler of claim 46, wherein 

the first disk file is a sparse disk file divided into sub-files, each sub-file having a 
starting address and contents comprising an ordered list of representations of data addresses; 
and 
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the instructions for performing the ordered merge including instructions for obtaining 
a starting address for a si b-file of the first disk file, the sub-file corresponding to one of the 
representations in the buffer, and performing an ordered merge of a subset of the 
representations in the non-current buffer, starting at the one representation, into the contents 
of the sub-file. 



50 



The web crawler 
the first disk file 



the ordered contents of 



of claim 46 wherein 

is a sparse disk file having empty entries interspersed among entries 
storing said representatii )ns of data addresses; and 

the address filtering module includes instructions for performing the ordered merge of 
e buffer with the contents of the sparse disk file by obtaining a 
starting address corresponding to each respective representations in the buffer, and 
sequentially scanning the first disk file, starting at the starting address, until the first of (A) a 
representation matching the respective representation is found and (B) one of the empty 
entries is foimd, and yvhen an empty entry is found storing the respective representation in the 
empty entry. 



51. The web crawler of claim 46 wherein the representation of the identified address 
comprises a checksum of at least a portion of the identified address. 



52. The web crawler of claim 46, wherein the address filtering module further includes 
instructions for: 

determinirfg whether the first disk file is empty and whether the representation is 
stored in the current buffer; and 

when the/first disk file is empty and the representation is not stored in the current 
buffer, storing the representation in the current buffer and scheduling the corresponding data 
set for downloading. 
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